Eight Lab

Today’s goals. Learn about:

  1. Regularisation techniques.
  2. Splines.
  3. Generalized additive models

Workflow

  1. Consider examples related to theoretical concepts
  2. Using R to solve some basic questions and to understand important aspects

Let’s start

  1. Download the R code (It is briefly commented)
  2. You will find XXX where I will ask you to code a bit
  3. Part of the code after XXX, depends on successfully filling these line of code

Regularisation techniques

Penalised likelihood framework

The idea of the regularisation techniques (shrinkage methods) is to consider a penalised likelihood framework, aiming to shrinkage toward zero the regression coefficients.

Here, we will see an example of ridge and LASSO regression. The shrinked coefficient can be obtained as
\[\tilde \beta= {\rm argmin}_{\boldsymbol \beta} \, \bigg( \frac{1}{n} \sum_{i=1}^{n}-l(\eta_i; y_i)+ \lambda \bigg [(1-\alpha)\frac{1}{2}\sum_{j=1}^{p}\beta^2_j + \alpha \sum_{j=1}^{p}|\beta_j| \bigg]\bigg) \]

  • \(\eta_i= \beta_0 + \beta_1x_{i1} + \ldots \beta_p x_{ip}\) is the \(i\)-th linear predictor.

  • \(-l(\eta_i; y_i)\): \(i\)-th negative log-likelihood contribution, up to an additive constant which does not depend on \(\beta\) (in the Gaussian case \(-l(\eta_i; y_i) =\frac{1}{2\sigma^2}(y_i - \eta_i)^2)\), with \(\eta_i=\mu_i\); while in the binomial case \(-l(\eta_i; y_i) =-y_i\eta_i + \log(1+\exp(\eta_i))\)), con \(\eta_i= {\rm logit}(p_i)\).

  • \(\alpha=0 \implies\) Ridge regression.

  • \(\alpha=1 \implies\) Lasso regression.

  • \(0<\alpha<1\) different regularisation technique.

  • \(\lambda\): tuning parameter, which controls the strength of penalisation. It could be fixed or selected by using cross-validation techniques.

Regularisation techniques

Penalised likelihood framework

The regularisation techniques allows to overcome possible problems of multicollinearity, and for the LASSO also to reduce the complexity of the models by selecting a model with a good trade-off between parsimony (it allows variable selection) and goodness of fit.

There is also a nice Bayesian interpretation for both Ridge and LASSO. See An Introduction to Statistical Learning: with Applications in R by James, G., Witten D., Hastie, T., Tibshirani, T. (2013)

In the following we will use the glmnet package routines to fit ridge and LASSO regression. Ridge regression in R can be also obtained with the function lm.ridge() in MASS package. See also https://glmnet.stanford.edu/articles/glmnet.html for an exhaustive introduction to the main functionality of the glmnet package and to explore several applications moving beyond the classical normal linear model.

Regularisation techniques: Prostate data example

Prostate data example

We consider the Prostate dataset (from the deprecated R package lasso2) to explore the relationship between the prostate specific antigen (PSA) and some clinical measures for men who were about to receive a radical prostatectomy.

  • lcavol: log(cancer volume)

  • lweight: log(prostate weight)

  • age: age

  • lbph: log(benign prostatic hyperplasia amount)

  • svi: seminal vesicle invasion

  • lcp: log(capsular penetration)

  • gleason: Gleason score

  • pgg45: percentage Gleason scores 4 or 5

  • lpsa: log(prostate specific antigen) \(\leftarrow\) outcome

Regularisation techniques: Prostate data example

load("Prostate.rda") 
str(Prostate)
'data.frame':   97 obs. of  9 variables:
 $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
 $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
 $ age    : num  50 58 74 58 62 50 64 58 47 63 ...
 $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
 $ svi    : num  0 0 0 0 0 0 0 0 0 0 ...
 $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
 $ gleason: num  6 6 7 6 6 6 6 6 6 6 ...
 $ pgg45  : num  0 0 20 0 0 0 0 0 0 0 ...
 $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
summary(Prostate)
     lcavol           lweight           age             lbph        
 Min.   :-1.3471   Min.   :2.375   Min.   :41.00   Min.   :-1.3863  
 1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   1st Qu.:-1.3863  
 Median : 1.4469   Median :3.623   Median :65.00   Median : 0.3001  
 Mean   : 1.3500   Mean   :3.653   Mean   :63.87   Mean   : 0.1004  
 3rd Qu.: 2.1270   3rd Qu.:3.878   3rd Qu.:68.00   3rd Qu.: 1.5581  
 Max.   : 3.8210   Max.   :6.108   Max.   :79.00   Max.   : 2.3263  
      svi              lcp             gleason          pgg45       
 Min.   :0.0000   Min.   :-1.3863   Min.   :6.000   Min.   :  0.00  
 1st Qu.:0.0000   1st Qu.:-1.3863   1st Qu.:6.000   1st Qu.:  0.00  
 Median :0.0000   Median :-0.7985   Median :7.000   Median : 15.00  
 Mean   :0.2165   Mean   :-0.1794   Mean   :6.753   Mean   : 24.38  
 3rd Qu.:0.0000   3rd Qu.: 1.1787   3rd Qu.:7.000   3rd Qu.: 40.00  
 Max.   :1.0000   Max.   : 2.9042   Max.   :9.000   Max.   :100.00  
      lpsa        
 Min.   :-0.4308  
 1st Qu.: 1.7317  
 Median : 2.5915  
 Mean   : 2.4784  
 3rd Qu.: 3.0564  
 Max.   : 5.5829  
Prostate$gleason <- as.numeric(factor(Prostate$gleason, 
                              labels = c("Low", "High","High","High")))

Regularisation techniques: Prostate data example

pairs(Prostate[,-c(5,7)], col = 1 + Prostate$svi, 
      pch = Prostate$gleason,
      main = paste("Prostate data, n = ", nrow(Prostate)))

Regularisation techniques: Prostate data example

par(mfrow = c(1, 2))
plot(lpsa ~ as.factor(gleason), data = Prostate)
plot(lpsa ~ as.factor(svi), data = Prostate)

Regularisation techniques: Prostate data example

Prostate data example

Let’s assume a linear model for the antigene level (lpsa) including all the available covariates \[y_i =\beta_0 + \sum_{j=1}^{p} x_{ij} \beta_j + \varepsilon_i, \quad \varepsilon_i \stackrel{iid}\sim \mathcal{N}(0, \sigma^2)\]

fit.lin <- lm(lpsa ~ ., data = Prostate) 
par(mfrow = c(1,2))
plot(fit.lin, which = c(1, 2))

Regularisation techniques: Prostate data example

fit1s <- summary(fit.lin); fit1s

Call:
lm(formula = lpsa ~ ., data = Prostate)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.75309 -0.32402 -0.01733  0.45958  1.67753 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.531783   0.869559   0.612  0.54241    
lcavol       0.559350   0.088013   6.355 8.94e-09 ***
lweight      0.479882   0.167877   2.859  0.00531 ** 
age         -0.020992   0.011041  -1.901  0.06054 .  
lbph         0.097596   0.058076   1.680  0.09641 .  
svi          0.753119   0.239577   3.144  0.00228 ** 
lcp         -0.111004   0.089934  -1.234  0.22038    
gleason      0.317300   0.209862   1.512  0.13413    
pgg45        0.002725   0.003815   0.714  0.47697    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6997 on 88 degrees of freedom
Multiple R-squared:  0.6632,    Adjusted R-squared:  0.6326 
F-statistic: 21.66 on 8 and 88 DF,  p-value: < 2.2e-16

VIF

Let’s explore the VIF index

library(car) 
vif(fit.lin)
  lcavol  lweight      age     lbph      svi      lcp  gleason    pgg45 
2.109957 1.362938 1.324965 1.391998 1.928886 3.100610 2.012392 2.270252 

Regularisation techniques: Prostate data example

Ridge regression

In order to use the capability of the glmnet we must organise the data to be passed to the glmnet function, which requires to provide seprately the outcome vector and the model matrix (without intercept). Note that the estimator takes the form \(\boldsymbol {\hat \beta}_{RIDGE} = (\mathbf X^\top \mathbf X + \lambda \mathbf K)^{-1} \mathbf X^t \mathbf y\) where \(K= {\rm diag} (0,1, \ldots, 1)\) as the penalization of the intercepts is undesired. Note that the ridge estimator is biased but has smaller variance than the LS one.

library(glmnet) 
n <- dim(Prostate)[1] 
p <- dim(Prostate)[2]
X <- as.matrix(Prostate[,1 : 8]) 
y <- Prostate[, p] 
fit_ridge <- glmnet(X, y, alpha = 0)

Ridge regression

Then we can print the percent (of null) deviance explained (%dev) and the value of \(\lambda\)

print(fit_ridge)

Call:  glmnet(x = X, y = y, alpha = 0) 

    Df  %Dev Lambda
1    8  0.00 843.40
2    8  0.52 768.50
3    8  0.57 700.20
4    8  0.63 638.00
5    8  0.69 581.30
6    8  0.76 529.70
7    8  0.83 482.60
8    8  0.91 439.80
9    8  1.00 400.70
10   8  1.09 365.10
11   8  1.20 332.70
12   8  1.31 303.10
13   8  1.44 276.20
14   8  1.57 251.60
15   8  1.72 229.30
16   8  1.89 208.90
17   8  2.06 190.40
18   8  2.26 173.50
19   8  2.47 158.00
20   8  2.70 144.00
21   8  2.96 131.20
22   8  3.23 119.60
23   8  3.53 108.90
24   8  3.86  99.26
25   8  4.21  90.44
26   8  4.60  82.40
27   8  5.02  75.08
28   8  5.47  68.41
29   8  5.96  62.34
30   8  6.49  56.80
31   8  7.06  51.75
32   8  7.67  47.15
33   8  8.34  42.97
34   8  9.05  39.15
35   8  9.81  35.67
36   8 10.62  32.50
37   8 11.49  29.61
38   8 12.42  26.98
39   8 13.40  24.59
40   8 14.45  22.40
41   8 15.55  20.41
42   8 16.71  18.60
43   8 17.92  16.95
44   8 19.20  15.44
45   8 20.52  14.07
46   8 21.90  12.82
47   8 23.33  11.68
48   8 24.79  10.64
49   8 26.30   9.70
50   8 27.83   8.84
51   8 29.39   8.05
52   8 30.97   7.34
53   8 32.56   6.68
54   8 34.15   6.09
55   8 35.73   5.55
56   8 37.31   5.06
57   8 38.86   4.61
58   8 40.39   4.20
59   8 41.88   3.83
60   8 43.33   3.48
61   8 44.74   3.17
62   8 46.11   2.89
63   8 47.42   2.64
64   8 48.67   2.40
65   8 49.87   2.19
66   8 51.02   1.99
67   8 52.10   1.82
68   8 53.13   1.66
69   8 54.11   1.51
70   8 55.03   1.38
71   8 55.89   1.25
72   8 56.71   1.14
73   8 57.47   1.04
74   8 58.19   0.95
75   8 58.86   0.86
76   8 59.49   0.79
77   8 60.08   0.72
78   8 60.63   0.65
79   8 61.14   0.60
80   8 61.62   0.54
81   8 62.06   0.49
82   8 62.47   0.45
83   8 62.84   0.41
84   8 63.19   0.37
85   8 63.51   0.34
86   8 63.81   0.31
87   8 64.08   0.28
88   8 64.32   0.26
89   8 64.55   0.23
90   8 64.75   0.21
91   8 64.93   0.19
92   8 65.10   0.18
93   8 65.25   0.16
94   8 65.38   0.15
95   8 65.50   0.13
96   8 65.61   0.12
97   8 65.70   0.11
98   8 65.78   0.10
99   8 65.86   0.09
100  8 65.92   0.08

Regularisation techniques: Prostate data example

Ridge regression

From the fit_ridge model we can select a specific model according to a value of \(\lambda\) (denoted as the argument s). For instance, one can select the model corresponding to \(\lambda=0.1\) and explore the coefficients.

round(t(coef(fit_ridge, s = 0.1)), 4)
1 x 9 sparse Matrix of class "dgCMatrix"
      (Intercept) lcavol lweight     age   lbph    svi     lcp gleason pgg45
s=0.1       0.445 0.4781  0.4588 -0.0159 0.0853 0.6627 -0.0372  0.3153 0.002

Regularisation techniques: Prostate data example

Ridge regression

However, it is better to visualise the path of its coefficient against the \(\ell_1\)-norm of the whole coefficient vector as \(\lambda\) varies.

Regularisation techniques: Prostate data example

Ridge regression

In order to select \(\lambda\), glmnet provides routine performing cross-validatation.


Call:  cv.glmnet(x = X, y = y, alpha = 0) 

Measure: Mean-Squared Error 

    Lambda Index Measure     SE Nonzero
min 0.0843   100  0.5450 0.1002       8
1se 1.2525    71  0.6432 0.1121       8

Regularisation techniques: Prostate data example

Ridge regression

Thus, we can visualise the results of the CV by plotting the MSE along the \(\log(\lambda)\) values, including upper and lower SE curves. The two marked values for \(\lambda\) correspond to the minimum value of the cross-validated MSE, and the \(\lambda\) value for which the cross-validated error is within one SE of the minimum.

plot(cvfit_ridge, cex.lab = 1.5)

Regularisation techniques: Prostate data example

Ridge regression

Then, the ridge regression coefficients can be obtained for such values. For instance.

c(cvfit_ridge$lambda.min, cvfit_ridge$lambda.1se)
[1] 0.08434274 1.25246297
round(t(cbind(coef(fit.lin), coef(cvfit_ridge, s="lambda.min"), 
              coef(cvfit_ridge, s="lambda.1se"))),4)
3 x 9 sparse Matrix of class "dgCMatrix"
           (Intercept) lcavol lweight     age   lbph    svi     lcp gleason
                0.5318 0.5593  0.4799 -0.0210 0.0976 0.7531 -0.1110  0.3173
lambda.min      0.4533 0.4884  0.4624 -0.0165 0.0869 0.6736 -0.0457  0.3163
lambda.1se      0.6706 0.2321  0.2814 -0.0010 0.0454 0.4050  0.0748  0.2410
            pgg45
           0.0027
lambda.min 0.0021
lambda.1se 0.0024

Regularisation techniques: Prostate data example

LASSO regression

Under LASSO regression a group of regression coefficients weakly linked to the outcome is shrinked. In addition, due to the use of the \(\ell_1\) penalty it allows to perform variable selection. Note that in this case we can not express the estimator as a linear estimator since \(2\mathbf X^\top \mathbf X \beta +2\mathbf X^\top \mathbf y + \lambda \sum_{j=1}^{p}{\rm sign}(\beta_j) = 0\) does not have explicit solution and it must be optimized numerically. In addition, it retains the same property of the ridge estimator: it is biased but has smaller variance than the LS estimator.

fit_lasso <- glmnet(X, y, alpha = 1)

Regularisation techniques: Prostate data example

LASSO regression

As before we can inspect the percentage (of null) deviance explained (%dev) and the value of \(\lambda\), as well as plotting the path of its coefficient against the \(\ell_1\)-norm of the whole coefficient vector as \(\lambda\) varies.

plot(fit_lasso, cex.lab = 1.4)

Regularisation techniques: Prostate data example

LASSO regression

So, we can leverage on the cross-validation procedure in order to select \(\lambda\).

cvfit_lasso <- cv.glmnet(X,y, alpha = 1)
cvfit_lasso

Call:  cv.glmnet(x = X, y = y, alpha = 1) 

Measure: Mean-Squared Error 

    Lambda Index Measure      SE Nonzero
min 0.0325    36  0.5330 0.05835       7
1se 0.1734    18  0.5862 0.06786       4

Regularisation techniques: Prostate data example

plot(cvfit_lasso, cex.lab = 1.5)

Regularisation techniques: Prostate data example

round(t(coef(cvfit_lasso, s = "lambda.min")), 4)
1 x 9 sparse Matrix of class "dgCMatrix"
           (Intercept) lcavol lweight     age   lbph    svi lcp gleason pgg45
lambda.min      0.2808 0.4998  0.4188 -0.0091 0.0669 0.5906   .  0.2668 1e-04
round(t(coef(cvfit_lasso, s = "lambda.1se")), 4)
1 x 9 sparse Matrix of class "dgCMatrix"
           (Intercept) lcavol lweight age lbph    svi lcp gleason pgg45
lambda.1se      0.8185 0.4636  0.2205   .    . 0.3763   .  0.0898     .

Regularisation techniques: Prostate data example

LASSO regression

In the following table, the estimated coefficients obtained via LS, Ridge and LASSO are reported

9 x 4 sparse Matrix of class "dgCMatrix"
                      LS        RIDGE      LASSOmin   LASSO1se
(Intercept)  0.531783274  0.453282865  0.2807625874 0.81846774
lcavol       0.559349696  0.488414340  0.4998369445 0.46361713
lweight      0.479882072  0.462350073  0.4187942228 0.22048557
age         -0.020992113 -0.016522892 -0.0091160952 .         
lbph         0.097596110  0.086857128  0.0669083784 .         
svi          0.753119464  0.673561826  0.5906239964 0.37629583
lcp         -0.111004126 -0.045663450  .            .         
gleason      0.317299916  0.316307635  0.2668243338 0.08980177
pgg45        0.002724942  0.002090012  0.0001382399 .         

Spline functions

Spline functions

Spline functions

Splines are functions defined as piecewise polynomials of fixed degree. The points of connections of the polynomials are called knots. The idea is that, between two knots, the function \(f(x)\) is a said polynomial and such polynomials meets at the knots, with further constraints to achieve regularity constraints. For instance, given \(K\) knots \(\zeta_1, \ldots, \zeta_K\), a cubic (degree = 3) regression spline is expressed as \[f(x;\boldsymbol \beta)= \beta_0 +\beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \sum_{k=1}^{K} \beta_{k+3} (x-\zeta_k)^3_{+} \] with \((x-\zeta_k)^3_{+}={\rm max}(0, (x-\zeta_k)^3)\). We call spline basis functions the \(h_j(x) = x^{j-1}\), \(j=1, \ldots, 4\) and \(h_j(x) = (x-\zeta_j)^3_{+}\), \(j=1, \ldots, k\). However, different basis expansions can be used to define a spline function and in general \[f(x;\boldsymbol \beta)= \sum_{k=0}^{K+d} \beta_{k} h_k(x) \]

We will adopt the B-spline which are built by means of the Cox–de Boor recursion formula.

Spline functions

Boston data example

We use a naive application on Boston dataset available in the MASS package to compare the use of polynomials and splines. The dataset collects information on housing values in suburbs of Boston and in particular we are going to model the continuous variable medv (median value of owner-occupied homes in $1000s) using a simple linear model including the variable lstat (lower status of the population, in percentage) as a predictor.

library(splines)
library(ggplot2)
library(MASS)
data(Boston) # From MASS package
ggplot(Boston, aes(lstat, medv)) + geom_point() + theme_bw()

Spline functions

Boston data example

As first step, we fit a linear model and then we analyse the residuals.

fit <- lm(medv ~ lstat, data = Boston)
par(mfrow = c(1, 2))
plot(fit, which = c(1, 2))

Spline functions

Boston data example

The residuals clearly show some problems. In previous lab sessions we deal with transformations of the outcome and/or the covariates. Here, we do not transform the outcome and we provide a comparison among regressions considering polynomials and splines. Then, let’s start to consider a quadratic term for lstat. There are two ways to achieve this (using or not orthogonal polynomials; see the help of poly)

fit.poly2 <- lm(medv ~ lstat + I(lstat^2), data = Boston)
fit.poly2 <- lm(medv ~ poly(lstat, 2, raw = TRUE), data = Boston) 

Spline functions

Boston data example

Then, we consider a a polynomial of degree 5 for lstat.

fit.poly5 <- lm(medv ~ poly(lstat, 5, raw = TRUE), data = Boston)
par(mfrow = c(2,2))
plot(fit.poly2,  which = c(1,2))
plot(fit.poly5, which = c(1,2))

Spline functions

Boston data example

Compare our fitted curve.

ggplot(Boston, aes(lstat, medv)) + 
 geom_point()+ theme_bw() + 
 stat_smooth(method = lm, formula = y ~ x, col = "red") +
 stat_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE), col = "green") +
 stat_smooth(method = lm, formula = y ~ poly(x, 5, raw = TRUE))

Spline functions

Boston data example

Now we consider the cubic regression splines, namely we consider B-splines by means of bs(). We fix the (internal) knots at first, second and third quartile. However, consider that a similar specification could be obtained by specifying the degrees of freedom equal to 6 as an alternative to the specification of the knots

knots <- quantile(Boston$lstat)[2 : 4]
fit.spline <- lm(medv ~ bs(lstat, knots = knots), data = Boston)
summary(fit.spline)

Call:
lm(formula = medv ~ bs(lstat, knots = knots), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.8071  -3.1502  -0.7389   2.1076  26.9529 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 50.628      2.582  19.608  < 2e-16 ***
bs(lstat, knots = knots)1  -13.682      3.886  -3.521  0.00047 ***
bs(lstat, knots = knots)2  -26.684      2.449 -10.894  < 2e-16 ***
bs(lstat, knots = knots)3  -28.416      2.917  -9.743  < 2e-16 ***
bs(lstat, knots = knots)4  -40.092      3.050 -13.144  < 2e-16 ***
bs(lstat, knots = knots)5  -39.718      4.212  -9.431  < 2e-16 ***
bs(lstat, knots = knots)6  -38.484      4.134  -9.308  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.206 on 499 degrees of freedom
Multiple R-squared:  0.6833,    Adjusted R-squared:  0.6795 
F-statistic: 179.5 on 6 and 499 DF,  p-value: < 2.2e-16

Spline functions

fit.spline <- lm(medv ~ bs(lstat, df = 6), data = Boston)
summary(fit.spline)

Call:
lm(formula = medv ~ bs(lstat, df = 6), data = Boston)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.8071  -3.1502  -0.7389   2.1076  26.9529 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)          50.628      2.582  19.608  < 2e-16 ***
bs(lstat, df = 6)1  -13.682      3.886  -3.521  0.00047 ***
bs(lstat, df = 6)2  -26.684      2.449 -10.894  < 2e-16 ***
bs(lstat, df = 6)3  -28.416      2.917  -9.743  < 2e-16 ***
bs(lstat, df = 6)4  -40.092      3.050 -13.144  < 2e-16 ***
bs(lstat, df = 6)5  -39.718      4.212  -9.431  < 2e-16 ***
bs(lstat, df = 6)6  -38.484      4.134  -9.308  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.206 on 499 degrees of freedom
Multiple R-squared:  0.6833,    Adjusted R-squared:  0.6795 
F-statistic: 179.5 on 6 and 499 DF,  p-value: < 2.2e-16

Spline functions

Boston data example

A graphical comparison is useful. Thus, we can plot the fitted lines comparing

  • Model 1: polynomial of degree = 3 (BLUE)
  • Model 2: cubic B-splines with df = 6 (RED)
  • Model 3: cubic B-splines with df = 100 (GREEN)
ggplot(Boston, aes(lstat, medv)) + 
 geom_point()+ theme_bw() + 
 stat_smooth(method = lm, formula = y ~ poly(x, 2, raw = TRUE) , se = FALSE)+
 stat_smooth(method = lm, formula = y ~ bs(x, df = 6), col = "red", se = FALSE) +
 stat_smooth(method = lm, formula = y ~ bs(x, df = 100), col = "green", se = FALSE ) 

Spline functions

Boston data example

We can analyse the AIC. Clearly the benefit of using regression splines (w.r.t. selecting a suitable polynomial) vanishes as we are overparametrising the model.

c(AIC(fit.poly2), AIC(fit.spline), AIC(lm(medv ~ bs(lstat, df = 100), data = Boston)))
[1] 3170.516 3114.610 3204.101

Spline functions

Boston data example

To conclude, let us consider one knot on the median and plot the fitted lines comparing three different specification of spline degrees:

  • Model 1: cubic B-splines (BLUE)
  • Model 2: quadratic B-splines (RED)
  • Model 3: linear B-splines (GREEN)

Spline functions

Boston data example

We can analyse the AIC

[1] 2996.732 2857.757 2857.191

Generalized additive models (GAM)

Generalized additive models (GAM)

GAM

A generalized additive model (GAM) is a generalized linear model (GLM) in which the linear predictor is given by a user specified sum of smooth functions of the covariates plus a conventional parametric component of the linear predictor. GAMs can be stated

  • \(Y_i \sim {\rm EF}(\mu_i, \phi)\), with \(\mu_i=\mathbb{E}(Y_i|\mathbf{x}_i)\) and \(\phi\) a scale parameter.

  • \(\eta_i = g(\mu_i)\): \(i\)-th linear predictor (\(g(\cdot)\) is the link function) \[\eta_i = f(x_{i1}, x_{i2}, \ldots, x_{ip}) = \beta_0 + \sum_{j}^{} f_{j}(x^j_i) + \sum_{k}^{} s_{k}(x^k_i) \;\;\;\] where the \(f_{j}(\cdot)\)’s denotes linear effects of \(x^j_i\) , while smooth effects of \(x^j_i\) are denoted with \(s_{k}(\cdot)\). Here, we restrict the formulation to univariate effects, although the r.h.s. of the formula above could account for bivariate smooth effects, smooth factor interactions, etc…

Generalized additive models (GAM)

GAM

By using the (linear) basis spline expansion the smooth effects can be represented as \[s_{j}(x^j_i) = \sum_{k} b^{j}_k (x^j_i) \alpha_k^{j}\,\,,\] where \(b^{j}_k\) are spline basis functions of a suitable dimension, while \(\alpha_k^{j}\) are regression coefficients. Since also the linear components can be expressed in the usual way \[\sum_{j}f_{j}(x^j_i) = \boldsymbol x^\top_i \boldsymbol \gamma,\] where \(\boldsymbol x^\top_i\) is the \(i\)-th row of the model matrix associated to the parametric components. Further a sum-to-zero constraint is generally adopted, since the effects are identifiable to within an intercept term. Thus, the linear predictor \(\boldsymbol \eta = (\eta_i, \ldots, \eta_n)^\top\) collapses into the following specification \[\boldsymbol \eta = \mathbf{X}\boldsymbol \beta\] where \(\boldsymbol \beta = (\boldsymbol \gamma^\top, \boldsymbol \alpha^\top)^\top\) and \(\mathbf{X}\) is the model matrix.

Generalized additive models (GAM)

GAM

GAMs consider a penalized likelihood
\[\tilde \ell(\beta) = \ell(\boldsymbol{\beta})-\lambda R(\boldsymbol{s}) = \ell(\boldsymbol{\beta}) - \frac{1}{2} \boldsymbol \beta^\top \tilde{\bf S}^{\boldsymbol \lambda} \boldsymbol \beta ,\]

where \(\boldsymbol \lambda\) is the vector of smoothing parameters, \(R(\boldsymbol{s})\) is a measure of roughness, and \(\tilde{\bf S}^{\boldsymbol \lambda}\) are positive semi-definite (padded with zeros). The complexity of the smooth effects is controlled during the fitting considering a quadratic penalty on \(\boldsymbol \alpha\) (which corresponds to give a prior on \(\boldsymbol \alpha\)).

  • If \(\lambda_j \rightarrow \infty \Rightarrow\) maximal smoothness. The fitted curve \(s(\cdot)\) is a straight line. The effective number of parameters associated to the predictor \(x_j\) is 1. No evidence for using a smooth function.
  • If \(\lambda_j \rightarrow 0\Rightarrow\) no smoothness. No penalty is considered and the effective number of parameters associated to the predictor \(x_j\) is greater than 1.

Generalized additive models (GAM)

GAM

Fitting the model:

  • Original backfitting algorithm by Hastie and Tibshirani (1986 https://www.jstor.org/stable/pdf/2245459.pdf) for additive models and a weighted version of it (penalized iteratively re-weighted least squares) for GAMs
  • Nested iteration scheme (Wood, 2016 https://www.tandfonline.com/doi/full/10.1080/01621459.2016.1180986) : model fitting procedure alternates
    1. Regression coefficient estimation by Newton’s optimisation of the penalised log-likelihood
    2. Smoothing parameter optimisation by maximising the Laplace approximate marginal likelihood (LAML). The latter is usually done done via Newton’s methods

Generalized additive models (GAM)

Additive model for modelling Ozone data

Let us consider the ozone dataset in the faraway package. This study aims to explore the relationship between atmospheric ozone concentration and meteorology in the Los Angeles Basin in 1976. A number of cases with missing variables have been removed for simplicity. The data frame reports 330 observations on the following 10 variables.

  • O3: Ozone concentration (ppm), at Sandbug AFB.

  • vh: Vandenburg 500 millibar height (in) visibility (miles)

  • wind: wind speed (mph)

  • humidity: humidity (percent)

  • temp: Sandburg AFB temperature (Farheneit) (note that in the original paper are reported in Celsius)

  • ibh: inversion base height (ft.)

  • dpg Daggett pressure gradient (mmhg)

  • ibt: inversion base temperature (Farheneit)

  • vis: visibility (miles)

  • doy: day of the year

Generalized additive models (GAM)

Additive model for modelling Ozone data

According to the original study (https://www.jstor.org/stable/pdf/2288473.pdf) we consider the log transformation of the outcome.

library(mgcv)
library(PerformanceAnalytics)
library(faraway)
data(ozone)
ozone$log_O3 <- log(ozone$O3) 
summary(ozone)
       O3              vh            wind           humidity    
 Min.   : 1.00   Min.   :5320   Min.   : 0.000   Min.   :19.00  
 1st Qu.: 5.00   1st Qu.:5690   1st Qu.: 3.000   1st Qu.:47.00  
 Median :10.00   Median :5760   Median : 5.000   Median :64.00  
 Mean   :11.78   Mean   :5750   Mean   : 4.848   Mean   :58.13  
 3rd Qu.:17.00   3rd Qu.:5830   3rd Qu.: 6.000   3rd Qu.:73.00  
 Max.   :38.00   Max.   :5950   Max.   :11.000   Max.   :93.00  
      temp            ibh              dpg              ibt       
 Min.   :25.00   Min.   : 111.0   Min.   :-69.00   Min.   :-25.0  
 1st Qu.:51.00   1st Qu.: 877.5   1st Qu.: -9.00   1st Qu.:107.0  
 Median :62.00   Median :2112.5   Median : 24.00   Median :167.5  
 Mean   :61.75   Mean   :2572.9   Mean   : 17.37   Mean   :161.2  
 3rd Qu.:72.00   3rd Qu.:5000.0   3rd Qu.: 44.75   3rd Qu.:214.0  
 Max.   :93.00   Max.   :5000.0   Max.   :107.00   Max.   :332.0  
      vis             doy            log_O3     
 Min.   :  0.0   Min.   : 33.0   Min.   :0.000  
 1st Qu.: 70.0   1st Qu.:120.2   1st Qu.:1.609  
 Median :120.0   Median :205.5   Median :2.303  
 Mean   :124.5   Mean   :209.4   Mean   :2.213  
 3rd Qu.:150.0   3rd Qu.:301.8   3rd Qu.:2.833  
 Max.   :350.0   Max.   :390.0   Max.   :3.638  

Generalized additive models (GAM)

Additive model for modelling Ozone data

In a first attempt, let us consider only the explanatory variables: temperature, visibility, wind speed and day of the year.

chart.Correlation(ozone[, c("vis", "doy", "vh", "wind", "log_O3")])

Generalized additive models (GAM)

Additive model for modelling Ozone data

Given \(\mu_i=\mathbb{E}(Y_i|\mathbf{x}_i)\), a simple example is: \(\mu_i= \beta_0+\sum_{j}^{}s_j(x^j_{i}), \ i=1,\ldots,n,\) where the dependent variable \(Y_{i} \sim \mathsf{Normal}(\mu_i, \sigma^2)\) and \(s_j\) are smooth functions of the covariates \(x^j\).

Let us start fitting the model including the temperature, the visibility, wind speed and the day of the year. By default the family function is family = gaussian(link = identity)

gamfit <- gam(log_O3 ~ s(temp) + s(vis) + s(doy) + s(wind), data = ozone)
summary(gamfit)

Family: gaussian 
Link function: identity 

Formula:
log_O3 ~ s(temp) + s(vis) + s(doy) + s(wind)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.21297    0.02042   108.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df      F  p-value    
s(temp) 4.338  5.365 36.216  < 2e-16 ***
s(vis)  6.439  7.565  8.192  < 2e-16 ***
s(doy)  5.629  6.814 15.965  < 2e-16 ***
s(wind) 2.024  2.587  6.906 0.000515 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.754   Deviance explained = 76.8%
GCV = 0.14614  Scale est. = 0.13754   n = 330

Generalized additive models (GAM)

Additive model for modelling Ozone data

From the summary we can see the distinction between parametric coefficients table and a summary for the smooth terms. Inferential procedures for parametric model effects are in line with the linear model theory, while the inferential procedure (hypothesis test) are more tricky and based on chi-square or sum of chi-square distribution. We do not go in details.

By default we built smooth effects with ten basis functions and accounting for the common intercept the number of parameters in the model are \(1+9+9+9+9 = 37\)

length(gamfit$coef)
[1] 37

Generalized additive models (GAM)

Additive model for modelling Ozone data

Analysing the first column of the table it seems that smooth effects are sensible for all the variables included (moderately for the wind effect). Looking at the smoothing parameters we can see the \(\lambda\)s for temperature, vis, doy are close to zero, while the \(\lambda_j\) associated with the wind speed is moderately large, suggesting that likely such an effect could be modelled using a linear effect.

round(gamfit$sp, 4)
s(temp)  s(vis)  s(doy) s(wind) 
 0.1849  0.0619  0.0760 18.0260 

Generalized additive models (GAM)

Visualisation and diagnostics of a gam model

Visualizing the fitted smooth effects \(\hat s(x^j)\): the average concentration of (log-) ozone concentration increases with the temperature, it’s higher for lower value of visibility and then it seems to stabilize, there is a peak around the end of march, it decreases as wind increases;

par(mfrow=c(1,2))
plot(gamfit, select = 1, scale = FALSE)
plot(gamfit, select = 2, scale = FALSE)

Generalized additive models (GAM)

Visualisation and diagnostics of a gam model

Visualizing the fitted smooth effects \(\hat s(x^j)\): the average concentration of (log-) ozone concentration increases with the temperature, it’s higher for lower value of visibility and then it seems to stabilize, there is a peak around the end of march, it decreases as wind increases;

par(mfrow=c(1,2))
plot(gamfit, select = 3, scale = FALSE)
plot(gamfit, select = 4, scale = FALSE)

Generalized additive models (GAM)

Visualisation and diagnostics of a gam model

Diagnostic plots involve the representation of the smooth function and the partial residuals defined as: \[\hat\epsilon^{part}_{ij}=\hat s_j(x_{ij})+\hat\epsilon_i^{P}\] where \(\hat\epsilon^{P}\) are the Pearson residuals of the model. Looking at this plot we are interested in noticing non linearity or wiggle behavior of the smooth function and if the partial residuals are evenly distributed around the function.

Generalized additive models (GAM)

pr_data <- cbind(ozone$temp, residuals(gamfit, type="pearson") +
                   gam(log_O3 ~ s(temp) + s(vis) + s(doy) + s(wind), data = ozone,
                       fit=FALSE)$X[,2:10] %*%coef(gamfit)[2:10]) 
par(mfrow = c(1,2))
plot(pr_data[,1],pr_data[,2], pch = 19)
rug(pr_data[,1])
plot(gamfit, residuals = TRUE, pch = 19, select = 1, se = FALSE)

Exercise

Try to add the missing information into the left plot, that is align with the right plot.

Generalized additive models (GAM)

Including smooth and linear effects

From the previous results we can compare some models. The following models consider

  • Model 2: linear effect for wind and smooth effects for the others

  • Model 3: linear effect for all the covariates (fitted using the gam function)

  • Model 4 = Model 3, but fitted using the glm function

# Model 2
gamfit2 <- gam(log(O3)~s(temp) + s(vis) + s(doy) + wind, data = ozone)
summary(gamfit2)

Family: gaussian 
Link function: identity 

Formula:
log(O3) ~ s(temp) + s(vis) + s(doy) + wind

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.42921    0.05981  40.617  < 2e-16 ***
wind        -0.04460    0.01159  -3.849 0.000144 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df      F p-value    
s(temp) 4.451  5.497 36.059  <2e-16 ***
s(vis)  6.206  7.354  8.042  <2e-16 ***
s(doy)  5.716  6.910 17.019  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.752   Deviance explained = 76.5%
GCV = 0.14722  Scale est. = 0.13902   n = 330

Generalized additive models (GAM)

# Model 2
gamfit2 <- gam(log(O3)~s(temp) + s(vis) + s(doy) + wind, data = ozone)
summary(gamfit2)

Family: gaussian 
Link function: identity 

Formula:
log(O3) ~ s(temp) + s(vis) + s(doy) + wind

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.42921    0.05981  40.617  < 2e-16 ***
wind        -0.04460    0.01159  -3.849 0.000144 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
          edf Ref.df      F p-value    
s(temp) 4.451  5.497 36.059  <2e-16 ***
s(vis)  6.206  7.354  8.042  <2e-16 ***
s(doy)  5.716  6.910 17.019  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.752   Deviance explained = 76.5%
GCV = 0.14722  Scale est. = 0.13902   n = 330
# Model 3
glmfit_viagam <- gam(log(O3) ~ temp + vis + doy + wind, data = ozone)
summary(glmfit_viagam)

Family: gaussian 
Link function: identity 

Formula:
log(O3) ~ temp + vis + doy + wind

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.3876365  0.1490553   2.601  0.00973 ** 
temp         0.0391508  0.0018246  21.457  < 2e-16 ***
vis         -0.0017879  0.0003311  -5.399 1.29e-07 ***
doy         -0.0014802  0.0002450  -6.041 4.20e-09 ***
wind        -0.0123246  0.0117254  -1.051  0.29399    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.666   Deviance explained =   67%
GCV = 0.1901  Scale est. = 0.18722   n = 330
# Model 4
glmfit <- glm(log(O3) ~ temp + vis + doy + wind, data = ozone)
summary(glmfit)

Call:
glm(formula = log(O3) ~ temp + vis + doy + wind, data = ozone)

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.3876365  0.1490553   2.601  0.00973 ** 
temp         0.0391508  0.0018246  21.457  < 2e-16 ***
vis         -0.0017879  0.0003311  -5.399 1.29e-07 ***
doy         -0.0014802  0.0002450  -6.041 4.20e-09 ***
wind        -0.0123246  0.0117254  -1.051  0.29399    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.1872198)

    Null deviance: 184.252  on 329  degrees of freedom
Residual deviance:  60.846  on 325  degrees of freedom
AIC: 390.56

Number of Fisher Scoring iterations: 2

Generalized additive models (GAM)

Including smooth and linear effects

Finally we can compare the models by using the AIC metric. Obviously the last two models provide the same AIC, while it is clear that considering the smooth effects for temperature, visibility and the day of the year allows to obtain a better model. Considering a smooth effect for the wind effect provides benefit of marginal significance.

cbind(AIC(gamfit, gamfit2, glmfit_viagam, glmfit))
                    df      AIC
gamfit        20.43023 302.6597
gamfit2       19.37318 305.2065
glmfit_viagam  6.00000 390.5554
glmfit         6.00000 390.5554

Generalized additive models (GAM)

Extending the model considering all the available covariates

Let us include all the available information in the linear predictor and let us start considering smooth effects for all the covariates.

gamfit_ext <- gam(log(O3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + 
                    s(dpg) + s(ibt) + s(vis) + s(doy), data = ozone)
summary(gamfit_ext)

Family: gaussian 
Link function: identity 

Formula:
log(O3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + 
    s(dpg) + s(ibt) + s(vis) + s(doy)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.21297    0.01717   128.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F  p-value    
s(vh)       1.000  1.000 10.780  0.00115 ** 
s(wind)     1.021  1.040  8.980  0.00292 ** 
s(humidity) 2.406  3.025  2.567  0.05192 .  
s(temp)     3.801  4.740  4.161  0.00155 ** 
s(ibh)      2.774  3.393  5.341  0.00107 ** 
s(dpg)      3.285  4.176 14.268  < 2e-16 ***
s(ibt)      1.000  1.000  0.495  0.48225    
s(vis)      5.477  6.635  6.054 3.35e-06 ***
s(doy)      4.612  5.738 25.200  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.826   Deviance explained =   84%
GCV = 0.10569  Scale est. = 0.097247  n = 330
round(gamfit_ext$sp,4)
       s(vh)      s(wind)  s(humidity)      s(temp)       s(ibh)       s(dpg) 
4.669441e+08 1.759525e+03 1.666900e+00 3.009000e-01 1.148300e+00 7.541000e-01 
      s(ibt)       s(vis)       s(doy) 
6.936647e+08 1.372000e-01 1.618000e-01 

Generalized additive models (GAM)

Extending the model considering all the available covariates

Combining the summary output and the plots, we can consider linear effects for wind, vh and ibt.

par(mfrow=c(3,3))
for(j in 1 : 9) plot(gamfit_ext, select = j, scale = FALSE)

Generalized additive models (GAM)

Extending the model considering all the available covariates

Then, we fit the model considering

  • Extended model 2: Linear effects of wind, vh and ibt and smooth effects of the remaining variables.

  • Extended model 3: Only linear effects

# Extended model 2
gamfit_ext2 <- gam(log(O3) ~ vh + wind + s(humidity) + s(temp) + s(ibh) + 
                     s(dpg) + ibt + s(vis) + s(doy), data=ozone)
summary(gamfit_ext2)

Family: gaussian 
Link function: identity 

Formula:
log(O3) ~ vh + wind + s(humidity) + s(temp) + s(ibh) + s(dpg) + 
    ibt + s(vis) + s(doy)

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)   
(Intercept) -6.0874893  2.4699664  -2.465  0.01427 * 
vh           0.0014501  0.0004384   3.308  0.00105 **
wind        -0.0311454  0.0101294  -3.075  0.00230 **
ibt          0.0006986  0.0010388   0.673  0.50177   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F  p-value    
s(humidity) 2.398  3.017  2.476 0.059093 .  
s(temp)     3.811  4.753  4.081 0.001778 ** 
s(ibh)      2.761  3.378  5.431 0.000953 ***
s(dpg)      3.354  4.259 14.320  < 2e-16 ***
s(vis)      4.559  5.627  6.572 3.91e-06 ***
s(doy)      4.571  5.692 25.618  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.826   Deviance explained = 83.9%
GCV = 0.10575  Scale est. = 0.097591  n = 330

Generalized additive models (GAM)

# Extended model 3 (GLM)
gamfit_ext3 <- gam(log(O3) ~ vh + wind + humidity + temp + ibh + 
                     dpg + ibt + vis + doy, data=ozone)
summary(gamfit_ext3)

Family: gaussian 
Link function: identity 

Formula:
log(O3) ~ vh + wind + humidity + temp + ibh + dpg + ibt + vis + 
    doy

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.301e-01  2.684e+00   0.235  0.81453    
vh          -1.244e-05  4.908e-04  -0.025  0.97980    
wind        -8.403e-03  1.226e-02  -0.685  0.49357    
humidity     5.084e-03  1.713e-03   2.968  0.00322 ** 
temp         2.993e-02  4.527e-03   6.611 1.60e-10 ***
ibh         -8.495e-05  2.687e-05  -3.161  0.00172 ** 
dpg          5.470e-04  1.026e-03   0.533  0.59414    
ibt          4.902e-04  1.237e-03   0.396  0.69221    
vis         -8.395e-04  3.410e-04  -2.462  0.01434 *  
doy         -1.021e-03  2.522e-04  -4.049 6.47e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


R-sq.(adj) =  0.709   Deviance explained = 71.7%
GCV = 0.16808  Scale est. = 0.16298   n = 330

Generalized additive models (GAM)

Extending the model considering all the available covariates

Let’s compare the models by means of AIC. From the AIC table, we can see the importance of including smooth effects for temp, dpg, vis, doy. Further, also including the smooth effects for humidity and ibh achieves better performances, while considering a smooth effect for vh, wind and ibt does not provide any benefit.

cbind(AIC(gamfit_ext,gamfit_ext2,gamfit_ext3))
                  df      AIC
gamfit_ext  27.37511 194.6941
gamfit_ext2 26.45411 195.0168
gamfit_ext3 11.00000 349.6878

Generalized additive models (GAM)

Selecting the smoothing parameter

The smoothing parameter (vector) \(\boldsymbol \lambda\) may be estimated by several methods, such as CV, AIC, BIC. . . By default, gam function consider the method = “GCV.Cp” and scale = 0, meaning that the \(\lambda\)s are estimated via GCV, or via AIC if the family is Binomial or Poisson. To understand the machinery consider the following lines of code

Generalized additive models (GAM)

sp <- gamfit_ext$sp # smoothing parameter obtained for the extended model 1

tuning.scale<-c(1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, 1e2, 1e3, 1e4, 1e5)
scale.exponent <- log10(tuning.scale)
n.tuning <- length(tuning.scale)
min2ll <- aic <- bic <- gcv <- rep(NA, n.tuning)
for (i in 1:n.tuning) {
  gamfit_opt<-gam(log(O3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh)+
                    s(dpg) + s(ibt) + s(vis) + s(doy), data = ozone, sp = tuning.scale[i] * sp)
  min2ll[i]<- -2 * logLik(gamfit_opt)
  aic[i] <- AIC(gamfit_opt)
  bic[i] <- BIC(gamfit_opt)
  gcv[i] <- gamfit_opt$gcv.ubre
}

par(mfrow = c(2, 2)) 
plot(scale.exponent, min2ll, type = "b",main = "-2 log likelihood")
plot(scale.exponent, aic, type = "b", main = "AIC")
plot(scale.exponent, bic, type = "b", main = "BIC")
plot(scale.exponent, gcv, type = "b", main = "GCV")

Generalized additive models (GAM)

Selecting the smoothing parameter

Let’s make a comparison between the model selected by mgcv (using GCV) and the model that should be selected according to the BIC.

min.bic <- 1e100
opt.tuning.scale <- NULL
for (i in 1 :n.tuning) {
  if (bic[i] < min.bic) {
    min.bic <- bic[i]
    opt.tuning.scale <- tuning.scale[i]
  }
}
opt.sp <- opt.tuning.scale * sp  # Smoothing parameter selected via BIC

gamobj <- gam(log(O3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + 
                s(dpg) + s(ibt) + s(vis) + s(doy),
                data = ozone, sp = opt.sp)
summary(gamobj)

Family: gaussian 
Link function: identity 

Formula:
log(O3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + 
    s(dpg) + s(ibt) + s(vis) + s(doy)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.21297    0.01838   120.4   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F  p-value    
s(vh)       1.000  1.000  3.650   0.0570 .  
s(wind)     1.002  1.004  5.862   0.0161 *  
s(humidity) 1.369  1.646  2.349   0.0574 .  
s(temp)     2.098  2.649  9.676 2.43e-05 ***
s(ibh)      1.657  2.030  6.658   0.0012 ** 
s(dpg)      1.727  2.192 13.232 1.47e-06 ***
s(ibt)      1.000  1.000  0.037   0.8472    
s(vis)      3.171  3.965  7.033 2.51e-05 ***
s(doy)      2.462  3.196 27.265  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.801   Deviance explained =   81%
GCV = 0.11734  Scale est. = 0.11147   n = 330

Generalized additive models (GAM)

Selecting the smoothing parameter

The results show clearly a better smoothing parameter selection by considering GCV than BIC.

summary(gamfit_ext)

Family: gaussian 
Link function: identity 

Formula:
log(O3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + 
    s(dpg) + s(ibt) + s(vis) + s(doy)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  2.21297    0.01717   128.9   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
              edf Ref.df      F  p-value    
s(vh)       1.000  1.000 10.780  0.00115 ** 
s(wind)     1.021  1.040  8.980  0.00292 ** 
s(humidity) 2.406  3.025  2.567  0.05192 .  
s(temp)     3.801  4.740  4.161  0.00155 ** 
s(ibh)      2.774  3.393  5.341  0.00107 ** 
s(dpg)      3.285  4.176 14.268  < 2e-16 ***
s(ibt)      1.000  1.000  0.495  0.48225    
s(vis)      5.477  6.635  6.054 3.35e-06 ***
s(doy)      4.612  5.738 25.200  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.826   Deviance explained =   84%
GCV = 0.10569  Scale est. = 0.097247  n = 330
AIC(gamobj, gamfit_ext)
                 df      AIC
gamobj     17.48649 230.5532
gamfit_ext 27.37511 194.6941

Generalized additive models (GAM)

Model performance on a test set

Let’s split the data according to this strategy:

  • We fit the extended model built above on the first 11th months (TRAIN)

  • We assess model performance on the last month (TEST) by means of the MSE

Of course, different strategies can be considered ( the data have a time structure to respect). Alternatives:

  • Rolling origin forecasting: the train set is further expanding of fixed time blocks and the the test set is simply shifting in time. See https://otexts.com/fpp3/tscv.html for further details.

  • Consider not a simple random sampling of the data, but consider points with a fixed gap in time for the test set (e.g. time = 30, 35, 60, …), so you are reducing extrapolation

Generalized additive models (GAM)

library(MLmetrics)
ozoneTrain <- ozone[ozone$doy <= 360,] 
ozoneTest <- ozone[ozone$doy > 360,]
y_train <- ozoneTrain$log_O3
y_test <- ozoneTest$log_O3
X_train <- as.matrix(ozoneTrain[,-c(1,11)])
X_test <- as.matrix(ozoneTest[,-c(1,11)])
gamfit_ext <- gam(log(O3) ~ s(vh) + s(wind) + s(humidity) + s(temp) + s(ibh) + 
                    s(dpg) + s(ibt) + s(vis) + s(doy), data = ozoneTrain)
gamfit_ext2 <- gam(log(O3) ~ vh + wind + s(humidity) + s(temp) + s(ibh) + 
                     s(dpg) + ibt + s(vis) + s(doy), data=ozoneTrain)
gamfit_ext3 <- gam(log(O3) ~ vh + wind + humidity + temp + ibh + 
                     dpg + ibt + vis + doy, data=ozoneTrain)
cvfit_ridge <- cv.glmnet(X_train, y_train, alpha = 0)
cvfit_lasso <- cv.glmnet(X_train, y_train, alpha = 1)

pred_GAM1 <- predict(gamfit_ext, newdata = ozoneTest)
pred_GAM2 <- predict(gamfit_ext2, newdata = ozoneTest)
pred_GAM3 <- predict(gamfit_ext3, newdata = ozoneTest)
pred_ridgemin <- as.numeric(predict(cvfit_ridge, newx = X_test, s = "lambda.min"))
pred_lassomin <- as.numeric(predict(cvfit_lasso, newx = X_test, s = "lambda.min"))

MSE_GAM1 <- MSE(y_test, pred_GAM1)
MSE_GAM2 <- MSE(y_test, pred_GAM2)
MSE_GAM3 <- MSE(y_test, pred_GAM3)
MSE_ridgemin <- MSE(y_test, pred_ridgemin)
MSE_lassomin <- MSE(y_test, pred_lassomin)
c(MSE_GAM1, MSE_GAM2, MSE_GAM3, MSE_ridgemin,  MSE_lassomin)
[1] 0.1765936 0.1654977 0.3405155 0.3642804 0.4163413
plot(ozoneTest$doy, ozoneTest$log_O3, ylim = c(0,2.5))
points(ozoneTest$doy, pred_GAM2, col = "blue", pch = 20)
points(ozoneTest$doy, pred_GAM3, col = "red", pch = 21)